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Abstract 

This paper presents the results of an em- 
pirical investigation of temporal reference 
resolution in scheduling dialogs. The algo- 
rithm adopted is primarily a linear-recency 
based approach that does not include a 
model of global focus. A fully automatic 
system has been developed and evaluated 
on unseen test data with good results. This 
paper presents the results of an intercoder 
reliability study a model of temporal refer- 
ence resolution that supports linear recency 
and has very good coverage, the results of 
the system evaluated on unseen test data, 
and a detailed analysis of the dialogs as- 
sessing the viability of the approach. 

1 Introduction 

Temporal information is often a significant part of 
the meaning communicated in dialogs and texts, but 
is often left implicit, to be recovered by the listener 
or reader from the surrounding context. Determin- 
ing all of the temporal information that is being 
conveyed can be important for many interpretation 
tasks. For instance, in machine translation, know- 
ing the temporal context is important in translating 
sentences with missing information. This is partic- 
ularly useful when dealing with noisy data, as with 
spoken input (Levin et al. 1995). In the following 
example, the third utterance could be interpreted in 
three different ways. 

si: (Ahora son las once y diez) 

Now it is eleven ten 
si: (Que tal a las doce) 

How about twelve 
si: (Doce a dos) 

Twelve to two 
or The twelfth to the second 
or The twelfth at two 

By maintaining the temporal context (i.e., the 5th 
of March 1993 at 12:00), the system will know that 



"12:00 to 2:00" is a more probable interpretation 
than "the 12th at 2:00". 

In addition, maintaining the temporal context 
would be useful for information extraction tasks 
dealing with natural language texts such as memos 
or meeting notes. For instance, it can be used to 
resolve relative time expressions, so that absolute 
dates can be entered in a database with a uniform 
representation. 

This paper presents the results of an empiri- 
cal investigation of temporal reference resolution in 
scheduling dialogs (i.e., dialogs in which participants 
schedule a meeting with one another). This work 
thus describes how to identify temporal information 
that is missing due to ellipsis or anaphora, and it 
shows how to determine the times evoked by deictic 
expressions. In developing the algorithm, our ap- 
proach was to start with a straightforward, recency- 
based approach and add complexity as needed to 
address problems encountered in the data. The al- 
gorithm does not include a mechanism for handling 
global focus (Grosz & Sidner 1986), for centering 
within a discourse segment (Sidner 1979; Grosz et al. 
1995), or for performing tense and aspect interpreta- 
tion. Instead, the algorithm processes anaphoric ref- 
erences with respect to an Attentional State (Grosz 
& Sidner 1986) structured as a linear list of all times 
mentioned so far in the current dialog. The list is 
ordered by recency, no entries are ever deleted from 
the list, and there is no restriction on access. The al- 
gorithm decides among candidate antecedents based 
on a combined score reflecting recency, a priori pref- 
erences for the type of anaphoric relation(s) estab- 
lished, and plausibility of the resulting temporal ref- 
erence. In determining the candidates from which 
to choose the antecedent, for each type of anaphoric 
relation, the algorithm considers only the most re- 
cent antecedent for which that relationship can be 
established. 

The algorithm was primarily developed on a cor- 
pus of Spanish dialogs collected under the JANUS 
project (Shum et al. 1994) (referred to hereafter as 
the "CMU dialogs" ) and has also been applied to a 
corpus of Spanish dialogs collected under the Art- 



work project (Wiebe et al. 1996) (hereafter referred 
to as the "NMSU dialogs"). In both cases, subjects 
were told that they were to set up a meeting based 
on schedules given to them detailing their commit- 
ments. The CMU protocol is akin to a phone conver- 
sation between people who do not know each other. 
Such strongly task-oriented dialogs would arise in 
many useful applications, such as automated infor- 
mation providers and automated phone operators. 
The NMSU data are face-to-face dialogs between 
people who know each other well. These dialogs 
are also strongly task-oriented, but only in these, 
not in the CMU dialogs, do the participants stray 
significantly from the scheduling task. In addition, 
the data sets are challenging in that they both in- 
clude negotiation, both contain many disfluencies, 
and both show a great deal of variation in how dates 
and times are discussed. 

To support the computational work, the temporal 
references in the corpus were manually annotated ac- 
cording to explicit coding instructions. In addition, 
we annotated the seen training dialogs for anaphoric 
chains, to support analysis of the data. 

A fully automatic system has been developed that 
takes as input the ambiguous output of a semantic 
parser (Lavie & Tomita 1993, Levin et al. 1995). 
The system performance on unseen, held-out test 
data is good, especially on the CMU data, showing 
the usefulness of our straightforward approach. The 
performance on the NMSU data is worse but sur- 
prisingly comparable, given the greater complexity 
of the data and the fact that the system was primar- 
ily developed on the simpler data. 

Rose et al. (1995), Alexandersson et al. (1997), 
and Busemann et al. (1997) describe other recent 
NLP systems that resolve temporal expressions in 
scheduling dialogs as part of their overall process- 
ing, but they do not give results of system perfor- 
mance on any temporal interpretation tasks. Kamp 
& Reyle (1993) address many representational and 
processing issues in the interpretation of temporal 
expressions, but they do not attempt coverage of a 
data set or present results of a working system. To 
our knowledge, there are no other published results 
on unseen test data of systems performing the same 
temporal resolution tasks. 

The specific contributions of this paper are the 
following. The results of an intcrcoder reliabil- 
ity study involving naive subjects are presented 
(in section |J) as well as an abstract presenta- 
tion of a model of temporal reference resolution 
(in section |J). In addition, the high-level algo- 
rithm is given (in section |j); the fully refined al- 
gorithm, which distinguishes many more subcases 
th an can be presented here, is available online 



at http://crl.nmsu.edu/Research/Projects/artwork 
Detailed results of an implemented system are also 
presented (in section |), showing the success of the 
algorithm. In the final part of the paper, we abstract 



away from matters of implementation and analyze 
the challenges presented by the dialogs to an algo- 
rithm that does not include a model of global focus 
(in section ^). We found surprisingly few such chal- 
lenges. 

2 The Corpus and Intercoder 
Reliability Study 

Consider this passage from the corpus (translated 
into English): 

Preceding time: Thursday 19 August 

si 1 On Thursday I can only meet after two pm 

2 From two to four 

3 Or two thirty to four thirty 

4 Or three to five 

s2 5 Then how does from two thirty to 
four thirty seem to you 
6 On Thursday 
si 7 Thursday the thirtieth of September 

An example of temporal reference resolution is 
that (2) refers to 2-4pm Thursday 19 August. Al- 
though related, this problem is distinct from tense 
and aspect interpretation in discourse (as addressed 
in, e.g., Webber 1988, Song & Cohen 1991, Hwang 
& Schubert 1992, Lascarides et al. 1992, and 
Kameyama et al. 1993). 

Because the dialogs are centrally concerned with 
negotiating an interval of time in which to hold a 
meeting, our representations are geared toward such 
intervals. Our basic representational unit is given in 
figure [l]. To avoid confusion, we refer to this basic 
unit throughout as a Temporal Unit (TU). 

The time referred to in, for example, "From 2 to 
4, on Wednesday the 19th of August" is represented 
as: 

((August, 19th, Wednesday, 2, pm) 
(August, 19th, Wednesday, 4, pm)) 

Thus, the information from multiple noun phrases 
is often merged into a single representation of the 
underlying interval evoked by the utterance. 

An utterance such as "The meeting starts at 2" is 
represented as an interval rather than as a point in 
time, reflecting the orientation of the coding scheme 
toward intervals. Another issue this kind of utter- 
ance raises is whether or not a speculated ending 
time of the interval should be filled in, using knowl- 
edge of how long meetings usually last. In the CMU 
data, the meetings all last two hours. However, so 
that the instructions will be applicable to a wider 
class of dialogs, we decided to be conservative with 
respect to filling in an ending time, given the starting 
time (or vice versa) , leaving it open unless something 
in the dialog explicitly suggests otherwise. 

There are cases in which times are considered as 
points (e.g., "It is now 3pm"). These are represented 



((start-month, start-date, start-day-of-week, start-hour&minute, start-time-of-day) 
(end- month, end-date, end-day-of-week, end-hour&minute, end-time-of-day)) 



Figure 1: Temporal Units 



as Temporal Units with the same starting and end- 
ing times (as in Allen (1984)). If just one ending 
point is represented, all the fields of the other are 
null. And, of course, all fields are null for utter- 
ances that do not contain temporal information. In 
the case of an utterance that refers to multiple, dis- 
tinct intervals, the representation is a list of Tempo- 
ral Units. 

A Temporal Unit is also the representation used 
in the evaluation of the system. That is, the sys- 
tem's answers are mapped from its more com plex 
internal representation (an ILT, see section 4.1) into 
this simpler vector representation before evaluation 
is performed. 

As in much recent empirical work in discourse pro- 
cessing (e.g., Arhenberg et al. 1995; Isard & Carletta 
1995; Litman & Passonneau 1995; Moser & Moore 
1995; Hirschberg & Nakatani 1996), we performed 
an intercoder reliability study investigating agree- 
ment in annotating the times. The goal in devel- 
oping the annotation instructions is that they can 
be used reliably by non-experts after a reasonable 
amount of training (cf. Passonneau & Litman 1993, 
Condon & Cech 1995, and Hirschberg & Nakatani 
1996), where reliability is measured in terms of the 
amount of agreement among annotators. High re- 
liability indicates that the encoding scheme is re- 
producible given multiple labclcrs. In addition, the 
instructions serve to document the annotations. 

The subjects were three people with no previous 
involvement in the project. They were given the 
original Spanish and the English translations. How- 
ever, as they have limited knowledge of Spanish, in 
essence they annotated the English translations. 

The subjects annotated two training dialogs ac- 
cording to the instructions. After receiving feed- 
back, they annotated four unseen test dialogs. Inter- 
coder reliability was assessed using Cohen's Kappa 
statistic (k) (Siegel & Castellan 1988, Carletta 
1996). 

k is calculated as follows, where the numerator is 
the average percentage agreement among the anno- 
tators (Pa) less a term for chance agreement (Pe), 
and the denominator is 100% agreement less the 
same term for chance agreement (Pe): 

Pa - Pe 







1 - Pe 



(For details on calculating Pa and Pe see Siegel & 
Castellan 1988). As discussed in (Hays 1988), k will 
be 0.0 when the agreement is what one would ex- 
pect under independence, and it will be 1.0 when 



the agreement is exact. A n value of 0.8 or greater 
indicates a high level of reliability among raters, with 
values between 0.67 and 0.8 indicating only moder- 
ate agreement (Hirschberg & Nakatani 1996; Car- 
letta 1996). 

In addition to measuring intercoder reliability, we 
compared each coder's annotations to the evaluation 
Temporal Units used to assess the system's perfor- 
mance. These evaluation Temporal Units were as- 
signed by an expert working on the project. 

The agreement among coders (k) is shown in table 
[L|. In addition, this table shows the average pairwise 
agreement of the coders and the expert (K avg ), which 
was assessed by averaging the individual k scores 
(not shown). There is a moderate or high level of 
agreement among annotators in all cases except the 
ending time of day, a weakness we are investigating. 
Similarly, there are reasonable levels of agreement 
between our evaluation Temporal Units and the an- 
swers the naive coders provided. 

Busemann et al. (1997) also annotate temporal 
information in a corpus of scheduling dialogs. How- 
ever, their annotations are at the level of individ- 
ual expressions rather than at the level of Temporal 
Units, and they do not present the results of an in- 
tercoder reliability study. 



3 Model 

This section presents our model of temporal ref- 
erence in scheduling dialogs. The treatment of 
anaphora in this paper is as a relationship between a 
Temporal Unit representing a time evoked in the cur- 
rent utterance, and one representing a time evoked 
in a previous utterance. The resolution of the 
anaphor is a new Temporal Unit that represents the 
interpretation of the contributing words of the cur- 
rent utterance. 

Fields of Temporal Units are partially ordered as 
in figure [|, from least to most specific. 

In all cases below, after the resolvent has been 
formed, it is subjected to highly accurate, trivial in- 
ference to produce the final interpretation (e.g., fill- 
ing in the day of the week given the month and the 
date). 

The cases of non-anaphoric reference: 

1. A deictic expression is resolved into a time in- 
terpreted with respect to the dialog date (e.g., 
"Tomo rrow " , "last week"). (See rule NA1 in 
section |4.2|.) 
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Table 1: Agreement among Coders (kappa coefficients by field) 
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Figure 2: Specificity Ordering 



2. A forward time is calculated by using the dialog 
date as a frame of reference. 
Let F be the most specific field in TU curren t 
above the level of time-of-day. 
Resolvent: The next F after the dialog date, 
augmented with the fillers of the fields in 
TU curren t at or below the level of time-of-day. 
(See rule NA2.) 

For both this and anaphoric relation (||), there 
are subcases for whether the starting and/or 
ending times are involved. Note that tense can 
influence the choice of whether to calculate a 
forward or a backward time from a frame of 
reference (Kamp & Reyle 1993), but we do not 
account for this in our model due to the lack of 
tense variation in the corpora. 

Ex: Dialog date is Mon, 19th, Aug 
"How about Wednesday at 2?" 
interpreted as 2 pm, Wed 21 Aug 

The cases of anaphora considered: 

1 . The utterances evoke the same time, or the sec- 
ond is more specific than the first. 
Resolvent: the union of the information in the 
two Temporal Units. (See rule Al.) 

Ex: "How is Tuesday, January 30th?" 
"How about 2?" 

(See also (l)-(2) of the corpus example.) 



2. The current utterance evokes a time that in- 
cludes the time evoked by a previous time, and 
the current time is less specific. (See rule A2.) 

Let F be the most specific field in TU current- 
Resolvent: All of the information in TU pre vious 
from F on up. 

Ex: "How about Monday at 2?" 

resolved to 2pm, Mon 19 Aug 
"Ok, well, Monday sounds good." 

(See also (5)-(6) in the corpus example.) 

3. This is the same as non-anaphoric case (^) 
above, but the new time is calculated with re- 
spect to TUprevious instead of the dialog date. 
(See rule A3.) 

Ex: "How about the 3rd week in August?" 
"Let's see, Monday sounds good." 
interpreted as Mon, 3rd week in Aug 

Ex: "Would you like to meet Wed, Aug 2nd?" 
"No, how about Friday at 2." 

interpreted as Fri, Aug 4 at 2pm 

4. The current time is a modification of the previ- 
ous time; the times are consistent down to some 
level of specificity X and differ in the filler of X. 
Resolvent: The information in TU previous above 
level X together with the information in 
TU CU rrent at and below level X. (See rule 
A4.) 



Ex: "Monday looks good." 

resolved to Mon 19 Aug 
"How about 2?" 

resolved to 2pm Mon 19 Aug 
"Hmm, how about 4?" 

resolved to 4pm Mon 19 Aug 

(See also (3)-(5) in the example from the cor- 
pus.) 

Although we found domain knowledge and task- 
specific linguistic conventions most useful, we ob- 
served in the NMSU data some instances of poten- 
tially exploitable syntactic information to pursue in 
future work (Grosz et al. 1995, Sidner 1979). For 
example, "until" in the following suggests that the 
first utterance specifies an ending time. 

"... could it be until around twelve?" 
"12:30 there" 

A preference for parallel syntactic roles might be 
used to recognize that the second utterance speci- 
fies an ending time too. 

4 The Algorithm 

This section presents our algorithm for tempo- 
ral reference resolution. After a brief overview, 
the rule-application architecture is described and 
then the rules composing the algorithm are given. 
As mentioned earlier, this is a high-level algo- 
rithm. Description of the complete algorithm, 
including a specification of th e normalized input 
representation (see section 4.1), can be obtained 
f rom a report available at the project web pa ge 



( http://crl.nmsu.edu/Rescarch/Projccts/artwork ). 

There is a rule for each of the relations presented 
in section |^. Those for the anaphoric relations in- 
volve various applicability conditions on the current 
utterance and a potential antecedent. For the cur- 
rent not-yet-resolved Temporal Unit, each rule is ap- 
plied. For the anaphoric rules, the antecedent con- 
sidered is the most recent one meeting the condi- 
tions. All consistent maximal mergings of the results 
are formed, and the one with the highest score is the 
chosen interpretation. 



4.1 Architecture 

Following (Qu et al. 1996) and (Shum et al. 1994), 
the representation of a single utterance is called an 
ILT (for IntcrLingual Text). An ILT, once it has 
been augmented by our system with temporal (and 
speech-act) information, is called an augmented ILT 
(an AILT). The input to our system, produced by a 
semantic parser (Shum et al. 1994; Lavie & Tomita 
1993), consists of multiple alternative ILT repre- 
sentations of utterances. To produce one ILT, the 
parser maps the main event and its participants into 
one of a small set of case frames (for example, a meet 
frame or an is busy frame) and produces a surface 
representation of any temporal information, which is 



faithful to the input utterance. Although the events 
and states discussed in the NMSU data are often 
outside the coverage of this parser, the temporal in- 
formation generally is not. Thus, the parser pro- 
vides us with a sufficient input representation for 
our purposes on both sets of data. This parser is 
proprietary, but it would not be difficult to produce 
just the portion of the temporal information that 
our system requires. 

Because the input consists of alternative sequences 
of ILTs, the system resolves the ambiguity in 
batches. In particular, for each input sequence of 
ILTs, it produces a sequence of AILTs and then 
chooses the best sequence for the corresponding ut- 
terances. In this way, the input ambiguity is resolved 
as a function of finding the best temporal interpreta- 
tions of the utterance sequences in context (as sug- 
gested in Qu et al. 1996). 

A focus list keeps track of what has been discussed 
so far in the dialog. After a final AILT has been 
created for the current utterance, the AILT and the 
utterance are placed together on the focus list (where 
they are now referred to as a discourse entity, or 
DE) . In the case of utterances that evoke more than 
one Temporal Unit, a separate entity is added for 
each to the focus list in order of mention. 

Otherwise, the system architecture is similar to a 
standard production system, with one major excep- 
tion: rather than choosing the results of just one of 
the rules that fires (i.e., conflict resolution), multiple 
results can be merged. This is a flexible architec- 
ture that accommodates sets of rules targeting dif- 
ferent aspects of interpretation, allowing the system 
to take advantage of constraints that exist between 
them (for example, temporal and speech act rules). 

Step 1. The input ILT is normalized. In the in- 
put ILT, different pieces of information about the 
same time might be represented separately in order 
to capture relationships among clauses. Our sys- 
tem needs to know which pieces of information are 
about the same time (but does not need to know 
about the additional relationships). Thus, we map 
from the input representation into a normalized form 
that shields the reasoning component from the id- 
iosyncracies of the input representation. After the 
normalization process, highly accurate, obvious in- 
ferences are made and added to the representation. 

Step 2. All rules are applied to the normalized in- 
put. The result of a rule application is a partial AILT 
(PAILT) — information this rule would contribute to 
the interpretation of the utterance. This informa- 
tion includes a certainty factor representing an a 
priori preference for the type of anaphoric or non- 
anaphoric relation being established. In the case 
of anaphoric relations, this factor gets adjusted by 
a term representing how far back on the f ocus list 
the antecedent is (in rules A1-A4 in section 4.2, the 



adjustment is represented by distance factor in the 
calculation of the certainty factor CF). The result of 
this step is the set of PAILTs produced by the rules 
that fired (i.e., those that succeeded). 

Step 3. All maximal mergings of the PAILTs are 
created. Consider a graph in which the PAILTs 
are the vertices, and there is an edge between two 
PAILTs iff the two PAILTs are compatible. Then, 
the maximal cliques of the graph (i.e., the maxi- 
mal complete subgraphs) correspond to the maximal 
mergings. Each maximal merging is then merged 
with the normalized input ILT, resulting in a set of 
AILTs. 

Step 4. The AILT chosen is the one with the high- 
est certainty factor. The certainty factor of an AILT 
is calculated as follows. First, the certainty factors 
of the constituent PAILTs are summed. Then, crit- 
ics are applied to the resulting AILT, lowering the 
certainty factor if the information is judged to be 
incompatible with the dialog state. 

The merging process might have yielded addi- 
tional opportunity for making obvious inferences, so 
that process is performed again, to produce the final 
AILT. 

4.2 Temporal Resolution Rules 

The rules described in this section (see figure |^) ap- 
ply to individual temporal units and return either 
a more-fully specified TU or an empty structure to 
indicate failure. 

Many of the rules calculate temporal information 
with respect to a frame of reference, using a separate 
calendar utility. The following describe these and 
other functions assumed by the rules below, as well 
as some conventions used. 

nex.t(TimeValue, RF): returns the next 
timeValue that follows reference frame RF. 
next (Monday, [...Friday, 19th,...]) = Monday, 
22nd. 

resolve_deictic(£>T, RF): resolves the 
deictic term DT with respect to the reference 
frame RF. 

merge(TC/l, TU2): if temporal units TU1 and 
TU2 contain no conflicting field fillers, returns a 
temporal unit containing all of the information 
in the two; otherwise returns {}. 

merge_upper(T[/l, TU2): like the previous func- 
tion, except includes only those field fillers from 
TU1 that are of the same or less specificity as 
the most specific field filler in TU2. 

specificity (TU): returns the specificity of the most 
specific field in TU. 

starting_fields(TJ7): returns a list of starting field 
names for those in TU having non-null values. 



structure^component: returns the named com- 
ponent of the structure. 

conventions: Values are in bold face and vari- 
ables are in italics. TU is the current temporal 
unit being resolved. TodaysDate is a represen- 
tation of the dialog date. FocusList is the list of 
discourse entities from all previous utterances. 

The algorithm does not cover a number of sub- 
cases of relations concerning the ending times. For 
instance, rule NA2 covers only the starting-time 
case of non-anaphoric relation 2. An example of an 
ending-time case that is not handled is the utterance 
"Let's meet until Thursday," under the meaning 
that they should meet from today through Thurs- 
day. This is an area for future work. 

5 Results 

As mentioned in section ||, the main results are based 
on comparisons against human annotation of the 
held out test data. The results are based on straight 
field-by-field comparisons of the Temporal Unit rep- 
resentations introduced in section Thus, to be 
considered as correct, information must not only be 
right, but it has to be in the right place. Thus, for 
example, "Monday" correctly resolved to Monday, 
19th of August, but incorrectly treated as a starting 
rather than an ending time, contributes 3 errors of 
omission and 3 errors of commission (and no credit 
is given for recognizing the date). 

Detailed results for the test sets are presented 
next, starting with results for the CMU data (see 
table ||). Accuracy measures the degree to which 
the system produces the correct answers, while pre- 
cision measures the degree to which the system's an- 
swers are correct (see the formulas in the tables). For 
each component of the extracted temporal structure, 
counts were maintained for the number of correct 
and incorrect cases of the system versus the tagged 
file. Since null values occur quite often, these two 
counts exclude cases when one or both of the val- 
ues are null. Instead, additional counts were used 
for those possibilities. Note that each test set con- 
tains three complete dialogs with an average of 72 
utterances per dialog. 

These results show that the system is performing 
with 81% accuracy overall, which is significantly bet- 
ter than the lower bound (defined below) of 43%. In 
addition, the results show a high precision of 92%. 
In some of the individual cases, however, the results 
could be higher due to several factors. For exam- 
ple, our system development was inevitably focussed 
more on some types of slots than others. An obvious 
area for improvement is the time-of-day handling. 
Also, note that the values in the Missing column 
are higher than those in the Extra column. This re- 
flects the conservative coding convention, mentioned 
in section pi for filling in unspecified end points. 



Rules for non- anaphoric relations 



Rule NA1: All cases of non-anaphoric relation 1. 
if there is a deictic term, DT, in TU then 

return {[when, resolve_deictic(Z?T, Todays Date)], [certainty, 0.9]} 

Rule NA2: The starting-time cases of non-anaphoric relation 2. 

if (most_specific(starting_ficlds(T[/)) < time_of_day) then 
Let / be the most specific field in starting_fields(T[7) 
return {[when, next(TU^f , Todays Date)], [certainty, 0.4]} 



Rules for anaphoric relations 
Rule Al: All cases of anaphoric relation 1. 

for each non-empty temporal unit TUfi from FocusList (starting with most recent) 
if specificity(TC//;) < spccificity(T[7) and not empty merge(TUfi, TU) then 
CF = 0.8 — distance_factor(T?7/;, FocusList) 
return {[when, merge(TJ7/ ; , TU)], [certainty, CF]} 

Rule A2: All cases of anaphoric relation 2. 

for each non-empty temporal unit TUfi from FocusList (starting with most recent) 
if specificity(T[//i) > specificity(T[7) and not empty merge_upper(T£//;, TU) then 
CF = 0.5 — distance_factor(TJ7/;, FocusList) 
return {[when, merge_upper(T[//;, TU)], [certainty, CF]} 

Rule A3: Starting-time case of anaphoric relation 3. 

if (most_specific(starting_fields(TC/)) < time_of_day) then 

for each non-empty temporal unit TU fi from FocusList (starting with most recent) 
if specificity(Ti7) > specificity(T?7/i) then 

Let / be the most specific field in starting_fields(Tl/) 

CF — 0.6 — distance-factor (T?7f;, FocusList) 

return {[when, next(TU— >/, TU fi ^start-date)], [certainty, CF]} 

Rule A4: All cases of anaphoric relation 4. 

for each non-empty temporal unit TUfi from FocusList (starting with most recent) 
if specificity(T[7) > specificity(T?7/i) then 

TUternp = TUfi 

for each {/ | / > most specific field in TU} 

TUternp^f = null 

if not empty mcrge(r[/ temp , TU) then 

CF = 0.5 — distance-factor (TUfi, FocusList) 
return {[when, merge(T[/ temp , TU)], [certainty, CF]} 



Figure 3: Main Temporal Resolution Rules 



Label 


Cor 


Inc 


Mis 


Ext 


Nul 


AccLB 


Acc 


Prec 


start 


















Month 


49 


3 


7 


3 





0.338 


0.831 


0.891 


Date 


48 


4 


7 


3 





0.403 


0.814 


0.873 


WeekDay 
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end 
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44 
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323 


28 


87 


17 


165 


0.428 


0.809 


0.916 



Legend 

Cor(rect): System and key agree on non-null value 

Inc(orrect): System and key differ on non-null value 

Mis (sing): System has null value for non-null key 

Ext(ra): System has non-null value for null key 

Nul(l): Both System and key give null answer 



Acc(uracy)LB: accuracy lower bound 

Acc(uracy): percentage of key values matched correctly 
(Correct + Null)/ (Correct + Incorrect + Missing + Null) 
Prec(ision): percentage of System answers matching the key 
(Correct + Null)/ (Correct + Incorrect + Extra + Null) 

Table 2: Evaluation of System on CMU Test Data 



A system that produces extraneous values is more 
problematic than one that leaves entries unspecified. 

Table p3 contains the results for the system on the 
NMSU data. This shows that the system performs 
respectably, with 69% accuracy and 88% precision, 
on this less constrained set of data. The precision 
is still comparable, but the accuracy is lower since 
more of the entries were left unspecified. Further- 
more, the lower bound for accuracy (29%) is almost 
15% lower than the one for the CMU data (43%), 
supporting the claim that this data set is more chal- 
lenging. 

More details on the lower bounds for the test data 
sets are shown next (see table 0). These values were 
derived by disabling all the rules and just evaluat- 
ing the input as is (after performing normalization, 
so the evaluation software could be applied). Since 
'null' is the most frequent value for all the fields, this 
is equivalent to using a naive algorithm that selects 
the most frequent value for a given field. The right- 
most column shows that there is a small amount of 
error in the input representation. This figure is 1 
minus the precision of the input representation (af- 
ter normalization). Note, however, that this is a 
close but not entirely direct measure of the error in 
the input, because there are a few cases of the nor- 
malization process committing errors and a few of 
it correcting them. Recall that the input is ambigu- 



ous; the figures in table [| are based on the system 
selecting the first ILT in each case. Since the parser 
orders the ILTs based on a measure of acceptability, 
this choice is likely to have the relevant temporal 
information. 

Since the above results are for the system tak- 
ing ambiguous semantic representations as input, 
the evaluation does not isolate focus-related errors. 
Therefore, two tasks were performed to aid in de- 
veloping the analysis presented in section ||. First, 
anaphoric chains and competing discourse entities 
were manually annotated in all of the seen data. 
Second, to aid in isolating errors due to focus issues, 
the system was evaluated on unambiguous, partially 
corrected input for all the seen data (the test sets 
were retained as unseen test data). 

The overall results are shown in the table [|. This 
includes the results described earlier to facilitate 
comparisons. Among the first, more constrained 
data, there are twelve dialogs in the training data 
and three dialogs in a held out test set. The average 
length of each dialog is approximately 65 utterances. 
Among the second, less constrained data, there are 
four training dialogs and three test dialogs. 

As described in the next section, our approach 
handles focus effectively. In both data sets, there 
are noticeable gains in performance on the seen data 
going from ambiguous to unambiguous input, espe- 
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31 
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end 
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0.716 
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0.642 
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23 
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42 
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0.482 
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401 


33 


221 


44 
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0.689 


0.879 



Table 3: Evaluation of System on NMSU Test Data 
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84 
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360 


10 


190 


0.428 


0.055 


nmsu 


65 


3 
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4 


171 


0.286 


0.029 



Table 4: Lower Bounds for both Test Sets 



seen/ 


emu/ 


Ambiguous / 


^dialogs 


^utterances 


Accuracy 


Precision 


unseen 


nmsu 


unambiguous 










seen 


emu 


ambiguous 


12 
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0.914 
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3 


193 
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0.916 


seen 
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358 


0.779 


0.850 


unseen 


nmsu 


ambiguous 


3 


236 


0.689 


0.879 



Table 5: Results on Corrected Input (to isolate focus issues) 



cially for the NMSU data. Therefore, the ambiguity 
in the dialogs contributes much to the errors. 

The better performance on the unseen, ambigu- 
ous NMSU data over the seen, ambiguous, NMSU 
data is due to several reasons. For instance, there is 
vast ambiguity in the seen data. Also, numbers are 
mistaken by the input parser for dates (e.g., phone 
numbers are treated as dates). In addition, a tense 
filter, to be discussed below in section ^, was imple- 
mented to heuristically detect subdialogs, improv- 
ing the performance of the seen NMSU ambiguous 
dialogs. This filter did not, however, significantly 
improve the performance for any of the other data, 
suggesting that the targeted kinds of subdialogs do 
not occur in the unseen data. 

The errors remaining in the seen, unambiguous 
NMSU data are overwhelmingly due to parser er- 
ror, errors in applying the rules, errors in mistaking 
anaphoric references for deictic references (and vice 
versa), and errors in choosing the wrong anaphoric 
relation. As will be shown in the next section, very 
few errors can be attributed to the wrong entities be- 
ing in focus due to not handling subdialogs or "mul- 
tiple threads" (Rose et al. 1995). 

6 Global Focus 

The algorithm is conspicuously lacking in any mech- 
anism for recognizing the global structure of the dis- 
course, such as in Grosz & Sidner (1986), Mann 
& Thompson (1988), Allen & Perrault (1980), and 
their descendants. Recently in the literature, Walker 
(1996) has argued for a more linear-recency based 
model of Attentional State (though not that dis- 
course structure need not be recognized), while Rose 
et al. (1995) argue for a more complex model of At- 
tentional State than is represented in most current 
computational theories of discourse. 

Many theories that address how Attentional State 
should be modeled have the goal of performing inten- 
tion recognition as well. We investigate performing 
temporal reference resolution directly, without also 
attempting to recognize discourse structure or inten- 
tions. We assess the challenges the data present to 
our model when only this task is attempted. 

We identified how far back on the focus list one 
must go to find an antecedent that is appropriate 
according to the model. Such an antecedent need 
not be unique. (We also allow antecedents for which 
the anaphoric relation would be a trivial extension 
of one of the relations in the model.) 

The results are striking. Between the two sets 
of data, out of 215 anaphoric references, there are 
fewer than 5% for which the immediately preceding 
time is not an appropriate antecedent. Going back 
an additional time covers the remaining cases. 

The model is geared toward allowing the most re- 
cent Temporal Unit to be an appropriate antecedent. 
For example, in the example for anaphoric relation 0, 



the second utterance (as well as the first) is a possi- 
ble antecedent of the third. A corresponding speech 
act analysis might be that the speaker is suggesting 
a modification of a previous suggestion. Consider- 
ing the most recent antecedent as often as possible 
supports robustness, in the sense that more of the 
dialog is considered. 

There are subdialogs in the NMSU data (but 
none in the CMU data) for which our recency algo- 
rithm fails because it lacks a mechanism for recog- 
nizing subdialogs. There are five temporal references 
within subdialogs that recency either incorrectly in- 
terprets to be anaphoric to a time mentioned before 
the subdialog or incorrectly interprets to be the an- 
tecedent of a time mentioned after the subdialog. 
Fewer than 25 cumulative errors result from these 
primary areas. In the case of one of the primary er- 
rors, recency commits a "self-correcting" error; with- 
out this luck, the remainder of the dialog would have 
represented additional cumulative error. 

In a departure from the algorithm, the system uses 
a simple heuristic for ignoring subdialogs: a time is 
ignored if the utterance evoking it is in the simple 
past or past perfect. This prevents a number of the 
above errors and suggests that changes in tense, as- 
pect, and modality are promising clues to explore 
for recognizing subdialogs in this kind of data (cf., 
e.g., Grosz & Sidner 1986; Nakhimovsky 1988). The 
CMU data has very little variation in tense and as- 
pect, the reason a mechanism for interpreting them 
was not incorporated into the algorithm. 

Rose et al. (1995) report that "multiple threads", 
when the participants are negotiating separate 
times, pose challenges to a stack-based discourse 
model on both the intentional and attentional levels. 
They posit a more complex representation of Atten- 
tional State to meet these challenges. They report 
improved results on speech-act resolution in a corpus 
of scheduling dialogs. 

Here, we focus on just the attentional level. The 
structure relevant for the task addressed in this pa- 
per is the following, corresponding to their figure 
2. There are four Temporal Units mentioned in the 
order TU U TU 2 , TU 3 , TU 4 (other times could be 
mentioned in between). The (attentional) multiple 
thread case is when TU± is required to be an an- 
tecedent of TU3, but TU2 is also needed to interpret 
TU4. Thus, TU2 cannot be simply thrown away or 
ignored once we are done interpreting TU3. This 
structure would definitely pose a difficult problem 
for our algorithm, but there are no realizations, in 
terms of our model, of this structure in the data we 
analyzed. 

The different findings might be due to the fact 
that different problems are being addressed. Hav- 
ing no intentional state, our model does not distin- 
guish times being negotiated from other times. It 
is possible that another structure is relevant for the 
intentional level: Rose et al. (1995) do not specify 



whether or not this is so. The different findings may 
also be due to differences in the data: although their 
scheduling dialogs were collected under similar pro- 
tocols, their protocol is like a radio conversation in 
which a button must be pressed in order to trans- 
mit, resulting in less dynamic interaction and longer 
turns (Villa 1994). 

An important discourse feature of the dialogs is 
the degree of redundancy of the times mentioned 
(Walker 1996). This limits the ambiguity of the 
times specified, and it also leads to a higher level of 
robustness, since additional DE's with the same time 
are placed on the focus list. These "backup" DE's 
might be available in case the rule applications fail 
on the most recent DE. Table || presents measures 
of redundancy. For illustration, the redundancy is 
broken down into the case where redundant plus ad- 
ditional information is provided ("redundant") ver- 
sus the case where the temporal information is just 
repeated ("reiteration"). This shows that roughly 
25% of the CMU utterances with temporal informa- 
tion contain redundant temporal references, while 
20% of the NMSU ones do. 

7 Conclusions 

This paper presented an intercoder reliability study 
showing strong reliability in coding the temporal in- 
formation targeted in this work. A model of tem- 
poral reference resolution in scheduling dialogs was 
presented which supports linear recency and has 
very good coverage; and, an algorithm based on the 
model was described. The analysis of the detailed re- 
sults showed that the implemented system performs 
quite well (for instance, 81% accuracy vs. a lower 
bound of 43% on the unseen CMU test data). 

We also assessed the challenges presented by the 
data to a method that does not recognize discourse 
structure, based on an extensively annotated corpus 
and our experience developing a fully automatic sys- 
tem. In an overwhelming number of cases, the last 
mentioned time is an appropriate antecedent with 
respect to our model, in both the more and the less 
constrained data. In the less constrained data, some 
error occurs due to subdialogs, so an extension to 
the approach is needed to handle them. But in none 
of these cases would subsequent errors result if, upon 
exiting the subdialog, the offending information were 
popped off a discourse stack or otherwise made in- 
accessible. Changes in tense, aspect, and modality 
are promising clues for recognizing subdialogs in this 
data, which we plan to explore in future work. 
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